Enable exporting monthly invoice to Iceberg by QuanMPhm · Pull Request #292 · CCI-MOC/invoicing

QuanMPhm · 2026-04-20T18:06:18Z

Closes #259. This PR is in draft, since I'm not too clear which dataframe we want to export. Also, I want consensus on the datatypes of the columns we want in the iceberg table. It seems this PR will depend on #285.

@knikolla @naved001 Regardless, for now, this draft should be enough to show you guys what the iceberg operations involved will look like, and the structure of code changes that will need to happen

jimmysway · 2026-05-07T17:22:18Z

Aren't the dataframes that we'll be exporting just the processed billable and non-billable dataframes?

jimmysway · 2026-05-07T17:28:49Z

Is Billable, Cluster Assignment Source, Cluster Assignment Method

Are three new columns that have been created from the backfilling process. going forward. We can just set Cluster Assignment Source and Cluster Assignment method to Something like "Regular Monthly Invoicing Cycle" or something like that.

There is already an Is Billable column or flag or indicator somewhere in the invoicing code it is just a matter of exporting that as well instead of separating billable from non-billable. I think the plan was to ideally export the entire combined dataframe (billable + nonbillable) to the iceberg.

@QuanMPhm @knikolla

QuanMPhm · 2026-05-07T18:29:20Z

We can just set Cluster Assignment Source and Cluster Assignment method to Something like "Regular Monthly Invoicing Cycle" or something like that.

I am hesistant to include those columns in the Iceberg table. I believe they should just be removed, since they're not useful for anything in invoicing. We had those columns in the first place because it helped in constructing the initial iceberg table.

As for what invoice to export, I fine with exporting the entire "main" dataframe, since that's equivalent to billable + nonbillable, and it is also what's currently in the initial iceberg table.

@knikolla @naved001 I'll wait for your review

knikolla · 2026-05-07T18:39:20Z

We shouldn't have Cluster Assignment Source, Cluster Assignment Method in the Iceberg table

Is Billable should be there.

During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e `PI Balance` column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError. To fix this, the codebase has been refactored to allow each processor to explicitly document which columns they initialize and use, defined in two new properties, `initializes_columns` and `operates_on_columns`. A helper function `_init_columns()` is added to initalize columns Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor only uses columns that itself or previous processors initialized, and no column is initialized more than once Additionally, each column will now be encapsulated as a `InvoiceColumn` instance. `InvoiceColumn` contains the name, datatype, and default values for each column This will also enable stricter and clearer type enforcement for data entering and leaving the pipeline A new processor `ValidateInputColumnsProcessor` is added to check the input dataframe to the processing pipeline has prerequisite columns, and to cast to appropriate types The e2e test data has been updated to surface the bug that was found. It did not failed during the PR that introduced the bug [1] because the test data didn't have the right conditions to trigger the PI-SU processor Refactored unit tests to accomodate the new processor by adding a new base test class. [1] CCI-MOC#279

QuanMPhm · 2026-06-04T16:37:54Z

@knikolla @naved001 @jimmysway I now consider this PR ready for review. I have rebased it on top of the column tracking PR, as the current state of the e2e test won't allow the invoice to be exported to iceberg easily. Specifically, Iceberg does not allow exporting of columns with a pa.null type, which happens when a column only contains None. The current e2e test does cause some columns to only be None. Since the column tracking PR ensures every column has a non-null type, the issue is removed.

Added new invoice `IcebergInvoice` to export invoice data to Iceberg tables The export process also includes a schema update step to allow updates to Iceberg table schema. New Iceberg integration test added to validate iceberg functionality E2E test updated to include iceberg exporting Both tests use a temporary sqlite catalog

QuanMPhm · 2026-06-12T18:24:32Z

@knikolla @naved001 @jimmysway I just learned that pyiceberg does not support interfacing with hadoop catalogs, which is what our current iceberg table in s3 is using. So I'll need to rewrite my PR to use something like PySpark.

QuanMPhm requested review from knikolla and naved001 April 20, 2026 18:06

QuanMPhm marked this pull request as draft April 20, 2026 18:06

QuanMPhm mentioned this pull request Apr 20, 2026

Track initialized and used columns in each processor #285

Merged

QuanMPhm force-pushed the 259/export_iceberg branch from ad4efbf to bfe5338 Compare June 4, 2026 16:37

QuanMPhm marked this pull request as ready for review June 4, 2026 16:37

QuanMPhm requested review from jimmysway and larsks June 4, 2026 16:37

QuanMPhm force-pushed the 259/export_iceberg branch from bfe5338 to bfbb760 Compare June 4, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable exporting monthly invoice to Iceberg#292

Enable exporting monthly invoice to Iceberg#292
QuanMPhm wants to merge 2 commits into
CCI-MOC:mainfrom
QuanMPhm:259/export_iceberg

QuanMPhm commented Apr 20, 2026

Uh oh!

jimmysway commented May 7, 2026

Uh oh!

jimmysway commented May 7, 2026

Uh oh!

QuanMPhm commented May 7, 2026

Uh oh!

knikolla commented May 7, 2026

Uh oh!

QuanMPhm commented Jun 4, 2026 •

edited

Loading

Uh oh!

QuanMPhm commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

QuanMPhm commented Apr 20, 2026

Uh oh!

jimmysway commented May 7, 2026

Uh oh!

jimmysway commented May 7, 2026

Uh oh!

QuanMPhm commented May 7, 2026

Uh oh!

knikolla commented May 7, 2026

Uh oh!

QuanMPhm commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuanMPhm commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QuanMPhm commented Jun 4, 2026 •

edited

Loading